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Abstract 

Background: Microsatellite (MS) markers have become an important tool for studying the population diversity, 
evolutionary history and multiplicity of infection (MOI) of malaria parasite infections. MS are typically selected on 
the basis of being highly polymorphic. However, it is known that the polymorphic potential (mutability) of each 
marker can vary as much as two orders of magnitude, which radically changes how diversity is represented in the 
genome from one marker to the next. Over the past decade, approximately 240 Plasmodium vivax MS have been 
published, comprising nine major panels of markers. Inconsistent usage of each panel has resulted in a surfeit of 
descriptive genetic diversity data that are largely incomparable between populations. The objective of this study 
was to statistically evaluate the quality of individual MS markers in order to validate a refined panel of markers that 
will provide a balanced picture of P. vivax population diversity. 

Methods: All previously published data, including genetic diversity indices, MS parameters, and population 
parameters, were assembled from 18 different global studies into a flat file to facilitate statistical analysis and 
modelling using JMP® Genomics 6.0 (SAS Institute Inc, Cary, NC, USA). Statistical modeling was employed to 
down-select markers with extreme variation among the mean number of alleles, expected heterozygosity, 
maximum repeat length and/or chromosomal location of the repeat. Individual MS were analysed by step-down 
whole model linear regression and standard least squares fit models, both stratified by annual parasite incidence 
to identify MS markers with values significantly different from the mean. 

Results: Of the 42 MS under evaluation in this study, 18 (nine high priority) were identified as ideal candidates for 
measuring population diversity between global regions, while five (two high priority) additional markers were 
identified as candidates for MOI studies. 

Conclusions: MS diversity was found to be a function of endemicity and motif structure. Evaluation of individual 
MS permitted the assembly of a refined panel of markers that can be reliably utilized in the field to compare 
population structures between global regions. 
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Background 

Microsatellite (MS) DNA sequences are short tandem re- 
peats, typically comprised of one (mono-) to six (hexa-) 
nucleotides (motifs), which repeat continuously without 
interruption (perfect repeat type), with intermittent nu- 
cleotide disruption (imperfect repeat type), with interrupt- 
ing insertions (interrupted repeat type) or in tandem with 
a different motif (compound repeat type). MS are caused 
and maintained by mutation events, such as replication 
slippage and/or slip-strand mismatch repair, which in- 
duces sequence length variation through expansions/inser- 
tions and contractions/deletions of the repeating motif(s) 
[1-4]. Regardless of repeat type, the total number of re- 
peats in the MS is referred to as the repeat length. Vari- 
ation in the repeat length causes size polymorphisms 
within the locus, which can be used to differentiate organ- 
isms in population diversity studies [5,6]. Given their 
mechanisms of mutation, MS are often considered neutral. 
However, this is somewhat debated due to the fact that 
MS are scattered throughout intergenic and intragenic re- 
gions of most chromosomes; therefore, it is important to 
consider the location prior to data interpretation in an ef- 
fort to subscribe to this neutral theory. Although MS lack 
the strain diversity resolution that whole genome sequen- 
cing provides, these markers remain an effective and easily 
deployable method for high-throughput genotyping in the 
field at moderate cost. Compared with single nucleotide 
polymorphism (SNP) genotyping, MS can provide in- 
creased resolution due to a higher polymorphic potential 
(i e, more alleles per locus), but can be problematic to in- 
terpret, standardize and calibrate across multiple studies. 

Since the introduction of MS in population diversity 
studies, great insight has been gained into the amount of 
observed and expected genetic diversity within extant pop- 
ulations of eukaryotic parasites [7-16]. For malaria para- 
sites, MS have rapidly become a popular alternative to 
polymorphic antigenic genes due to their purported neu- 
trality, ubiquity throughout genomes and utility for de- 
scribing the evolutionary history of global populations. 
Furthermore, the relatively unconstrained polymorphic na- 
ture of MS loci permits increased detection of multiclonal 
infections [17], which can be useful when describing the 
history of endemicity and the stability of transmission 
within a specific global region [18,19]. For Plasmodium 
vivax, the utility of these markers may even extend to de- 
scribing infection dynamics across time, e g, whether an in- 
dividual is presenting with a relapse, recrudescence or 
reinfection [12,20-22]. One of the major objectives in Plas- 
modium global diversity studies is to generate data that 
can be compared between populations of differing geog- 
raphies, ecologies, climates, endemicities, and transmission 
intensities; however, such studies require standardizing ex- 
perimental and analytical methods across a large and geo- 
graphically separated community of researchers [23,24]. 



Unlike the frequently used Plasmodium falciparum 
MS marker panel published by Anderson et al [8], there 
are approximately nine different panels of P. vivax MS 
markers (including two panels with minisatellites with 
motifs that exceed six nucleotides) [12,25-32], describing 
at least 240 loci scattered throughout the genome. The 
majority of these MS markers were identified in silico 
and their polymorphic nature tested on DNA from refer- 
ence strains [12,25-32]. However, in the last decade there 
have been at least 22 studies investigating P. vivax MS 
population diversity across seven global regions, 17 
countries, and at 47 different field sites. Of the markers 
utilized in these studies (N = 68), only 42 have been 
tested in more than one field site, and seven of these are 
second-generation versions of a previously published 
marker, which results in moderately redundant popula- 
tion diversity data. Consequently, there are many sets of 
descriptive data that remain largely incomparable, owing 
to minimal genetic marker overlap between studies. 

In most studies, microsatellites are selected on the basis 
of being highly polymorphic. However, it is known that 
the polymorphic potential of each marker can vary as 
much as two orders of magnitude, which radically changes 
how diversity is represented in the genome from one 
marker to the next [33-35]. The objective of this study was 
to statistically evaluate the quality of the MS markers cur- 
rently in use, in order to generate a refined panel of 
markers that will provide a balanced picture of P. vivax 
genomic diversity. A statistically validated P. vivax MS 
panel would provide at least two benefits to the P. vivax 
community. First, statistical evaluation provides a means 
of assessing marker suitability at the outset of a study, for 
the purpose of describing population structure and multi- 
plicity of infection (MOI). The inherent mutability of the 
repeat region is not easily assessed in the absence of long- 
term in vitro culture, which is not routine for P. vivax 
parasites due to their strict preference for reticulocytes. 
However, the quality of MS markers can be evaluated sta- 
tistically by investigating the association between diversity 
level and endemicity, as well as, the repeat length [36-43], 
motif length, repeat type and location of the tandem re- 
peat. The second benefit derived from the use of a stan- 
dardized panel is the ability to compare population 
parameters, such as diversity and structure between global 
regions, which is a basic premise of population genetics 
studies. Before these benefits can be realized, the current 
MS marker panels must be re-evaluated and if possible 
consolidated to permit a more comprehensive and com- 
parative approach to P. vivax population diversity across 
global populations. 

Analyses resulted in a standardized panel of 18 (nine 
high priority) high-quality MS markers distributed across 
nine chromosomes. These markers are ideal for popula- 
tion diversity studies, as they will reliably describe overall 
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population structure as a function of endemicity, while 
also accommodating a wide range of polymorphic vari- 
ation. Additionally, a panel of five (two high priority) 
highly polymorphic MS markers was identified for MOI 
studies. These markers consistently exceed the predicted 
diversity level within different global regions and are suit- 
able for describing infections with more than clone due to 
a possible increased mutability. Standardized usage of 
these panels will facilitate a clearer understanding of the 
history of this parasite as it has evolved in different eco- 
logical and epidemiological niches. 

Methods 

Microsatellite marker selection 

Of the -240 MS markers that have been described in the 
literature, 42 were selected for this study because each had 
been used in more than one field study, and therefore 
could be compared. These 42 MS markers were verified 
against the reference genomes [25,44], tested for redun- 
dancy against all published MS loci, located in the genome 
(intergenic or intragenic), and identified by repeat type 
(perfect or non-perfect, which includes all repeat types that 
are not deemed perfect). Of the 42 MS markers, seven 
were found to be second-generation versions of a previ- 
ously published marker (first-generation), which had either 
been redesigned to optimally capture the repeat region 
or were unknowingly duplicated during the discovery 
stage (NCBI Primer Blast). In most studies, the second- 
generation marker was used in the same study as the first- 
generation marker, permitting a direct comparison among 
genetic diversity indices. In all cases, variation between 
first- and second-generation markers was insignificant. As 
a result, only data from the first-generation markers was 
utilized in this study, however, second-generation markers 
are identified throughout this manuscript in "( )" immedi- 
ately following the first-generation name. Concatenating 
these multi-generation markers resulted in a final panel of 
35 discrete MS markers. Further, genomic location with re- 
spect to presence within intergenic or intragenic regions 
was determined. Of the 35 markers, 20 were located in 
known or hypothetical genes, while only 15 were located in 
non-coding intergenic regions. The repeat type also varied, 
with 26 MS markers identified as having perfect repeats 
and nine with non-perfect repeats. Additional file 1 de- 
scribes each of the MS loci analysed in this study. 

Data consolidation 

All previously published data, including genetic diversity 
indices (ie, number of alleles per locus and expected 
heterozygosity (H e ) and repeat length size), MS parame- 
ters (ie, location, repeat type, and motif length) and 
population parameters (ie, regional location, annual pa- 
rasite incidence (API) and sample sizes), were assem- 
bled from 18 different global studies (representing seven 



regions, 14 countries, and 35 field sites) into a single 
database to facilitate statistical analysis and modelling 
using JMP @ Genomics 6.0 (©2012 SAS Institute Inc, 
Cary, NC, USA). See Additional file 2 for a summary of 
studies included in the analysis. 

Given the fact that genetic diversity is a function of 
endemicity, it was essential to establish endemicity cat- 
egories to stratify downstream analyses. However, the re- 
ported metrics for calculating malaria incidence varied 
extensively across the global regions examined in this 
study. In an effort to accommodate this variation, all 
metrics were simplified by converting them to the "an- 
nual parasite incidence" (API - the number of micro- 
scopically confirmed malaria cases during one year per 
1,000) during the time at which the samples were col- 
lected for each study. Previously described methods for 
classifying endemicity [45] were utilized to permit cat- 
egorical transformation of the numerical API values 
(<0.05 stratum, hypo-endemic and typically focal trans- 
mission; >0.05 stratum, meso- to hyper-endemic) to fa- 
cilitate data analysis. 

Defining the polymorphic potential of individual MS 

The objective of this study was to identify quality MS 
markers in order to generate a refined panel of markers 
that will provide a balanced picture of P. vivax genomic 
diversity. Given the fact that the polymorphic potential of 
each marker can generate unequal variation [33-35], statis- 
tical modelling was employed to down-select markers with 
extreme variation. Number of alleles, expected heterozy- 
gosity (H e ) and/or repeat lengths in excess of the mean 
may indicate unregulated polymorphic potential, with 
heightened heterogeneity that can obscure downstream 
population parameter estimations (Figure 1). Although 
MS markers in excess of the mean may not directly trans- 
late into distinct and observable patterns within the para- 
site population structure, these markers can be used as 
tools to define the MOI. Conversely, a reduction of alleles, 
H e , and/or repeat length may not provide a strong enough 
signal to discern population structure when it does exist 
(Figure 1). Though no studies have indicated an overall re- 
duction in MS diversity, this is expected to become more 
of a concern in regions with elimination platforms, as di- 
versity decreases with reduced transmission. Markers in 
significant excess of the mean are termed "Excess", those 
significantly reduced from the mean are termed "Re- 
duced", and those with no difference from the mean are 
termed "Balanced". For these reasons, markers that deviate 
significantly from the mean in either direction were down- 
selected from the final core panel of markers, which can 
be used to clearly define the population structure without 
bias from excess or reduced diversity (Figure 2). In all 
cases, individual MS were analysed in step-down whole 
model linear regression and standard least squares fit 



Sutton Malaria Journal 2013, 12:447 
http://www.malariajournal.eom/content/1 2/1 /447 



Page 4 of 12 






T 

Reduced diversity— V57 Excess diversity— Balanced diversity- 
limited population %, compounded decipherable 

structure population structure population structure 

Figure 1 Population structure scenarios based on the polymorphic potential of microsatellite markers, (i) Markers generating reduced 
diversity due to reduced mutability will not be able to resolve existing population structure; (ii) markers generating excess diversity due to 
increased mutability may confound local population structure, making it difficult to compare different geographic regions; (iii) markers with 
balanced diversity calibrated to the population comparisons of interest, can decipher population structure and provide meaningful insight into 
parasite migration and evolution. 



models, both stratified by API, to identify markers with 
values significantly different from the mean. 

Results and discussion 

MS diversity as a function of endemicity 

The amount of genetic diversity within a region is a 
function of parasite incidence [46-48], and high quality 
MS markers should reflect this relationship (Figure 3a). 
To test the overall link between diversity and endem- 
icity, diversity indices for all MS markers across all glo- 
bal studies were correlated with the categorical API 
strata. For the API >0.05 stratum, the mean number of 
alleles per locus (x = 11.4, a =11.9, 95% CI = 9.4, 13.4), 
mean H e (x = 0.79, o = 0.15, 95% CI = 0.76, 0.81) and the 
mean maximum repeat length (x = 36.3, 0* = 19.8, 95% 
CI = 33.0, 40.0) was significantly higher than the mean 
number of alleles per locus (x = 6.7, o* = 5.4, 95% CI = 
6.0, 7.4), mean H e (x = 0.63, a = 0.24, 95% CI = 0.60, 
0.66) and the mean maximum repeat length (x = 31.8, 
a = 18.9, 95% CI = 29.4, 34.2) in the API <0.05 stratum 
(p <0.0001, p <0.0001 and p = 0.0294, ANOVA, respect- 
ively) (Figure 3b). This confirms that genetic diversity is 
a function of parasite endemicity, as regions with greater 
endemicity are expected to have a greater repertoire of 
genetically diverse parasites circulating in the popula- 
tion. Individual analysis for each MS, including down- 
selection data and panel recommendations, can be found 
in Figure 2. 

Polymorphic potential of repetitive regions 
Understanding the role of microsatellite parameters 
on diversity 

Earlier reports considering the polymorphic potential of 
P. vivax MS identified differences in motif length and re- 
peat length as likely causes for allelic variation between 
MS markers [37,41,43]; however, much of this discussion 
was had prior to the publication of the draft genome 
[25] and subsequent whole genome sequencing projects 



[44,49]. In other organisms, like fruit flies, humans and 
chimpanzees, researchers have found that certain motifs, 
based on their length and nucleotide composition, have 
higher rates of mutability than others, suggesting that 
repeat length is an intrinsic function of motif mutability 
[36,38-40]. However, this circularity of this hypothesis is 
difficult to break and one cannot help but question the 
root cause for increased mutability, as the size of the 
repeat must in part be a of the function of motif mut- 
ability. Here, both motif length and repeat length are 
re-investigated, as well as, the genomic location of the 
tandem repeat (intergenic versus intragenic) and the re- 
peat type (perfect, interrupted, compound or interrupted 
and compound) as likely factors for MS mutability. 

Motif length as a function of MS diversity 

The 35 markers included in this study displayed five dif- 
ferent motif lengths: di- (n = 2), tri- (n = 18), tetra- (n = 
8), hepta- (n = 2), and octa- (n = 2) nucleotide. Though 
the hepta- and octa-nucleotide motifs are not true 
microsatellites, but rather minisatellites, the use of these 
markers in more than one field site warrants consideration 
in this analysis. Of these five motif lengths, tri-nucleotide 
motifs revealed the most dynamic range of polymorphic 
potential, with the largest range of alleles (range = 1-103), 
H e (range = 0.01-0.99) and maximum repeat length 
(range = 10-87). Octa-nucleotide motifs revealed the 
most conservative polymorphic potential, with the smal- 
lest range of alleles (range = 2-13), H e (range = 0.01-0.9) 
and maximum repeat length (range = 10-17) (Figure 4); 
though an increased sample size is required to have ad- 
equate power to be confident in this result. 

Next, a linear regression was used to determine the rela- 
tionship between motif length and the mean number of al- 
leles per locus, H e and maximum repeat length for all MS, 
stratified by API. There were no significant correlations 
between motif length and the mean number of alleles per 
locus or H e in either API stratum (Figure 4). However, in 
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Summary of results for individual analyses 



MS Marker 


Endemicity 
vs. Diversity 


Motif 
Length/s. 
Diversity 


Repeat 
Length/s. 
Diversity 


Repeat 
Typevs. 
Diversity 


Genomic 
Positiorvs. 
Diversity 


Recommend 
ation 


Chr. 
(priority) 


Ref. 


PvMS7 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


2(A) 


[31] 


3.502 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1 0 Panel 


3(A) 


[28] 


3.503 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


3 (B) 


[28] 


MS1 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


3 (C) 


[30] 


ms033 (PvMS5) 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


3 (D) 


[29, 31] 


MS12 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


5(A) 


[30] 


MS15 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1 0 Panel 




[OUJ 


MS4 (ms050) 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


6(A) 


[29, 30] 


ms038 (PvMS9) 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


6 (B) 


[29, 31] 


MS9 (Pv6635) 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


8(A) 


[29, 32] 


MS20 (ms116) 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


10(A) 


[29, 30] 


MS6 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


11(A) 


[30] 


11.162 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


11 (B) 


[28] 


MS10 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


13(A) 


[30] 


13.239 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


13(B) 


[28] 


PvMS8 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


13(C) 


[31] 


14.297 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1° Panel 


14(A) 


[28] 


rVMob 


Balanced 


Balanced 


Balanced 


Balanced 


Balanced 


1 ° Panel 


14 (B) 


[31] 


PvMS2 


Balanced 


Reduced 
(p < 0.03) 


Balanced 


Balanced 


Balanced 


2° Panel 


3(A) 


[31] 


MS2 


Balanced 


Balanced 


Balanced 


Balanced 


Excess 
(p < 0.02) 


2° Panel 


6(A) 


[30] 


MS5 


Balanced 


Balanced 


Reduced 
(p < 0.05) 


Balanced 


Balanced 


2° Panel 


6(B) 


[30] 


6.34 


Balanced 


Balanced 


Balanced 


Balanced 


Reduced 
(p < 0.05) 


2° Panel 


6(C) 


[28] 


PvMS4 


Balanced 


Excess 
(p < 0.003) 


Balanced 


Balanced 


Balanced 


2° Panel 


6(D) 


[31] 


ms196 (PvMS3) 


Balanced 


Excess 
(p < 0.04) 


Balanced 


Balanced 


Balanced 


2° Panel 


8(A) 


[29, 31] 


Pvsal1814 


Balanced 


Excess 
(p < 0.02) 


Balanced 


Balanced 


Balanced 


2° Panel 


14(A) 


[32] 


3.27 


Excess 
(p <0.02) 


Excess 
(p < 0.005) 


Excess 
(p < 0.004) 


Excess I 
(p<0.02) I 


MOI 


3(A) 


[28] 


8.504 


Excess 
(p < 0.03) 


Balanced 


Balanced 


Excess 
(p < 0.03) 


Excess 
(p<0.01) 


MOI 


8(A) 


[28] 


PvMS11 


Excess 
(p < 0.03) 


Balanced 


Excess 
(p < 0.006) 


Excess 
(p<0.01) 


Balanced 


MOI 


8(B) 


[31] 


MS16 


Excess 
(p< 0.0001) 


Excess 
(p < 0.0004) 


Excess 
(p < 0.05) 


Balanced 


Excess 
(p< 0.001) 


MOI 


9(A) 


[30] 


MS8 (ms206) 


Excess 
(p < 0.03) 


Excess 
(p < 0.05) 


Excess 
(p < 0.03) 


Balanced 


Balanced 


MOI 


12(A) 


[29, 30] 


1.501 


Balanced 


Reduced 
(p < 0.02) 


Balanced 


Balanced 


Reduced 
(p < 0.02) 


Exclude 


1 (A) 


[28] 


MS3 


Reduced 
(p < 0.05) 


Balanced 


Balanced 


Balanced 


Reduced 
(p < 0.02) 


Exclude 


4(A) 


[30] 


PvMSIO 


Balanced 


Reduced 
p < 0.003 


Reduced 
p<0.01 


Reduced 
p<0.01 


Balanced 


Exclude 


5(A) 


[31] 


MS7 


Reduced 
(p<0.01) 


Reduced 
(p < 0.006) 


Reduced 
(p < 0.05) 


Reduced 
(p < 0.02) 


Reduced 
(p < 0.02) 


Exclude 


12(A) 


[30] 


PvMS1 


Balanced 


Reduced 
(p < 0.04) 


Balanced 


Balanced 


Reduced 
(p < 0.02) 


Exclude 


12(B) 


[31] 



Figure 2 Summary of statistically validated P. vivax MS for usage in population diversity and MOI studies. MS diversity indices (mean 
number of alleles per locus, expected heterozygosity, and maximum repeat length) were correlated with six microsatellite and population 
parameters (endemicity, motif length, repeat length, repeat type and genomic position) to identify MS with excess, reduced or balanced diversity 
in comparison with the mean. Balanced in both API categories (API >0.05 or API <0.05) is indicated in green. Excess or reduced in API >0.05, 
API <0.05 or both is indicated in white, black and gray, respectively. Based on this data, MS were categorized into four recommended groups: 
1° Panel, 2° Panel, Exclude, and MOI. "1° Panel" indicates balanced diversity in all six test categories and usage as the primary panel of markers for 
measuring population diversity. "2° Panel" indicates markers with significant excess or reduction of diversity in one of six test categories. These 
markers should be used cautiously, as they may misrepresent the diversity level due to inherent unbalanced mutability. "Exclude" indicates 
markers with significant reduction in diversity in more than one of the six test categories. These markers are not recommended, as they 
consistently result in a misrepresentation of population diversity due to reduced polymorphic potential. "MOI" (multiplicity of infection) indicates 
MS markers that consistently have significant excess diversity in more than one test category. MOI markers are ideal for identifying multiclonal 
infections. For chromosomes with more than one MS marker tested, priority has been assigned (A-D). Priority is based on the total number of 
studies that have utilized the marker, with a higher priority being placed on markers that have been used more frequently. Bold font indicates 
markers of highest priority. 
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Incidence/endemicity/ 
transmission 




>0.05(Meso-High) s0.05 (Low and Focal) 

Annual Parasite Incidence 



Figure 3 a, b MS diversity is a function of parasite endemicity. (a) Schematic illustrating that population diversity increases as a function of 
increased endemicity, captured by parasite incidence and transmission; (b) for all MS markers combined, the box plots compare the mean 
number (no) of alleles per locus (y-axis, blue), expected heterozygosity (H e ) (y-axis, red) and maximum repeat length (y-axis, green) between 
different API categories (x-axis). Character values (A and B) denote statistical significance between API strata (p <0.0001, ANOVA). 



the <0.05 API stratum there was a significant negative cor- 
relation between motif length and mean maximum repeat 
length (p <0.0001, ANOVA, bivariate fit), suggesting that 
shorter motif lengths may generate an increased number 
of repeats; however, this is not reflected in the number of 
alleles per locus (Figure 4). This correlation exists only as 
a trend for the >0.05 API stratum (p = 0349, ANOVA, bi- 
variate fit), likely due the limited usage of hepta- and octa- 
nucleotide motifs in regions of higher endemicity (Fig- 
ure 4). Regardless, the negative correlation between motif 
length and repeat length establishes the motif structure 
as an important factor to be considered when selecting 
MS markers for genetic diversity studies. Individual 
analysis for each MS, including down-selection data 
and panel recommendations, can be found in Figure 2. 

Repeat length as a function of MS diversity 

Previous studies have reported that the mutability of the 
repeat region may be guided by the repeat length, as in- 
creased replication slippage is probable in sequences 
with high repeat numbers [21,38,41-43]. In this study, 
repeat length was highly variable, ranging from seven to 
87 repeats across the 35 MS markers. Statistical modelling 
was used to correlate the mean number of alleles per locus 
and H e with the mean maximum number of repeats in the 
repeat length array across all studies, stratified by API. In 
both API strata, API <0.05 and API >0.05, there was a sig- 
nificant positive linear correlation between the number of 
alleles per locus (p = 0.0011 and p = 0.0240, ANOVA, bi- 
variate fit, respectively) and H e (p <0.0001 and p = 0.0064, 
ANOVA, bivariate fit, respectively) with increasing repeat 
length (Figure 5). These results confirm previous work 
by Russell et al. [37] and provide additional insight into 



the maintenance of tandem repeats, as the parasites are 
transmitted with different rates in regions of differing 
endemicity. Individual analysis for each MS, including 
down-selection data and panel recommendations, can 
be found in Figure 2. 

Repeat type as a function of MS diversity 

Sequence analysis of MS loci has revealed that MS may 
exist in either perfect or non-perfect types. Perfect 
microsatellites will have a repeated motif that continues 
uninterrupted for a specific repeat length, while non- 
perfect microsatellites may exist as imperfect, inter- 
rupted or compound repeats. Although hard evidence is 
lacking for the cause of these non-perfect repeat types, 
the generation of single point mutations within a MS 
motif may offer some explanation for imperfect repeats, 
while interrupted repeat types may be caused by inser- 
tion mutations and compound repeat types may be the 
result of recombinatory events. Regardless of the mech- 
anistic cause, the mutability of these different repeat 
types is of considerable interest as it may assist in the se- 
lection of quality MS loci for population diversity studies. 
As mentioned in the Methods section, of the 35 markers 
examined in this study, 26 MS markers were identified as 
having perfect repeats and nine were defined as non- 
perfect (either imperfect, interrupted or compound). For 
the purpose of this analysis the repeat type, limited to per- 
fect versus non-perfect repeat types, was correlated with 
the mean number of alleles per locus, H e and maximum 
repeat length (stratified by API). 

On the most basic level, non-perfect repeat types ap- 
pear to be associated with increased diversity in all di- 
versity indices, regardless of API stratification. In the 
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Figure 4 Motif length as a function of MS diversity. For all MS markers combined, the box plots compare the mean number (no) of alleles 
per locus (y-axis, blue), expected heterozygosity (H e ) (y-axis, red) and maximum repeat length (y-axis, green) across all motif lengths (x 1o -axis), 
stratified API (x 2o -axis). There were no significant correlations between motif length and the mean no of alleles per locus (panels i, ii) or mean H e 
(panels iii, iv) in either API stratum. However, in the <0.05 API stratum there was a significant negative correlation between motif length and 
mean maximum repeat length (panel vi) (p <0.0001, ANOVA, bivariate fit). This correlation exists only as a trend for the >0.05 API stratum (panel v) 
(p = 0.349, ANOVA, bivariate fit). 



API <0.05 stratum, the mean maximum repeat length 
for the non-perfect repeats (x = 48.3, o = ±17.7, 95% 
CI = 44.3, 52.3) was significantly higher than the mean 
for perfect repeats (x = 23.9, o = ±13.6, 95% CI = 22.0, 
26.0) (p <0.0001, ANOVA) (Figure 6). A similar observa- 
tion was made in the API >0.05 stratum, where the 
mean maximum repeat length for the non-perfect re- 
peats (x = 49.6, or = ±19.3, 95% CI = 44.5, 54.8) was 
also significantly higher, compared with perfect repeats 
(x = 27.1, a = ±14.1, 95% CI = 24.0, 30.2) (p <0.0001, 
ANOVA) (Figure 6). Significance between these repeat 
types was also achieved when considering the mean 
number of alleles per locus and the H e in the API >0.05 
stratum. Non-perfect repeats had a significantly higher 
mean number of alleles per locus (x = 16.2, o* = ±16.5, 
95% CI = 11.8, 20.6) and H e (x = 0.82, o = ±0.15, 95% 
CI = 0.78, 0.86), when compared with the mean number 



of alleles per locus (x = 8.1, a = ±5.2, 95% CI = 6.9, 9.2) 
(p <0.0001, ANOVA) and H e (x = 0.77, cr = ±0.14, 95% 
CI = 0.73, 0.80) (p = 0.0192, ANOVA) of the perfect re- 
peats (Figure 6). 

However, it seems counterintuitive that non-perfect 
repeats might generate increased diversity levels in these 
populations. Further investigation of these repeat types 
revealed that when compared with perfect repeats (x = 

4.0, o = ±1.9, 95% CI = 3.8, 4.2), non-perfect repeats (x = 

3.1, cr = ±0.31, 95% CI = 3.1, 3.2) are significantly biased 
towards smaller motif lengths (p <0.0001, ANOVA), 
which were previously found to be associated with in- 
creased diversity levels in the population. Therefore, it is 
more likely that the increased diversity found to be asso- 
ciated with non-perfect repeats is a byproduct of the ac- 
tual motif structure or the combination of different 
repeating motifs when the non-perfect repeat is a 
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Figure 5 Repeat length as a function of MS diversity. For all MS markers combined, the box plots compare the mean number (no) of alleles 
per locus (y-axis, blue) and expected heterozygosity (H e ) (y-axis, red) across repeat length (x 1o -axis), stratified API (x 2o -axis). In both API strata, 
API <0.05 (panels ii, iv) and API >0.05 (panels i, iii), the mean no of alleles per locus (p = 0.001 1 and p = 0.0240, ANOVA, bivariate fit, 
respectively) and mean H e (p <0.0001 and p = 0.0064, ANOVA, bivariate fit, respectively) were positively correlated with the repeat length. 



compound type. Individual analysis for each MS, includ- 
ing down-selection data and panel recommendations, 
can be found in Figure 2. 

Genomic position as a function of MS diversity 

The proximity of a MS to a coding region in the genome 
will likely influence the polymorphic potential within the 
locus. For example, recent studies in P. falciparum have 
indicated that H e in MS is inversely correlated with the 
proximity of the MS locus to the P. falciparum chloro- 
quine resistance transporter gene, which is known to be 
associated with chloroquine resistance in this parasite 
[50-53]. This relationship is likely a result of genetic 
hitchhiking, but is still important to consider when 
selecting MS loci to describe genetic diversity in a popu- 
lation as it may impact the polymorphic potential. As 
previously mentioned, of the 35 markers examined in 
this study, 20 were located in known (N = 8) or hypo- 
thetical genes (N = 12), while only 15 were located in 
non-coding intergenic regions. For the purpose of this 
analysis the genomic position was correlated with the 
mean number of alleles per locus, H e and maximum re- 
peat length (stratified by API). 



For the both API strata, there were no significant differ- 
ences among the mean number of alleles per locus or 
mean H e between intergenic and intragenic regions. How- 
ever, in both API strata, API <0.05 and API >0.05, the 
mean maximum repeat length did vary significantly be- 
tween intergenic (x = 24.0, o = ±13.5, 95% CI = 21.2, 26.7; 
x = 31.1, a = ±13.0, 95% CI = 27.2, 35.0) and intragenic 
(x = 36.8, g = ±20.1, 95% CI = 33.6, 40.1; x = 38.8, o = ± 
22.0, 95% CI = 34.2, 43.3, respectively), with intragenic loci 
having significantly higher numbers of repeats than inter- 
genic regions (p <0.0001 and p = 0.0303 for API <0.05 and 
API >0.05, respectively, ANOVA) (Figure 7). To help ex- 
plain this finding, genomic position was correlated with 
repeat type (perfect versus non-perfect) and motif length. 
In this study, intragenic MS are significantly comprised of 
non-perfect repeat types compared with the intergenic 
MS, 44.1% compared with 23.3%, respectively (p <0.0001, 
Fishers Exact). Likewise, these highly diverse intragenic 
markers are significantly biased towards smaller motif 
lengths (p <0.0001, ANOVA). Neglecting to observe an in- 
crease in the mean number of alleles or H e , would likely 
negate the possibility that these intragenic regions have in- 
creased polymorphic potential, but again, this analysis 
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Figure 6 Repeat type as a function of MS diversity. For all MS markers combined, the box plots compare the mean number (no) of alleles per 
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revealed that there is a fundamental association between 
the structure of the motif and the amount of genetic diver- 
sity present in the MS. Individual analysis for each MS, in- 
cluding down-selection data and panel recommendations, 
can be found in Figure 2. 

Conclusions 

Genetic diversity data were mined from 18 population 
diversity studies (Additional file 1) in an effort to evalu- 
ate the quality of data generated from published P. vivax 
MS markers (N = 42, reduced to N = 35 after NCBI Pri- 
mer Blast indicated redundancies) and also to produce 
recommended MS panels for both population diversity 
and MOI studies. Though there is a convention among 
population diversity studies to select MS markers with 
extremely high polymorphic potential, there are MS with 
increased and decreased mutation rates that will falsely 
inflate and deflate the genetic diversity of parasite popu- 
lation, respectively. Therefore, when considering individ- 
ual MS, markers may generate excess, reduced or 



balanced (no difference) diversity when compared with 
the mean across all markers (Figure 1). Given the inher- 
ent unequal MS mutability [33-35], data quality was ex- 
amined by using robust step-down statistical models 
that compared the genetic diversity metrics (number of 
alleles per locus, H e and maximum repeat lengths) of in- 
dividual MS markers with the mean of all MS markers 
(stratified by API) to examine the impact of parasite en- 
demicity, motif length, repeat length, repeat type and 
genomic position as a function of MS diversity. Individ- 
ual analysis for each MS, including down-selection data 
and panel recommendations, can be found in Figure 2. 

As expected, the results indicated that the amount of 
genetic diversity present within all global regions is a 
function of parasite endemicity; individual MS analysis 
revealed that five of the 35 markers were in significant 
excess of the mean, while two were significantly reduced 
from the mean (Figure 2, Figure 3a,b). Other factors, 
such as the motif length and repeat length, were also sig- 
nificantly correlated with the amount of diversity present 
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Figure 7 Genomic position as a function of MS diversity. For all MS markers combined, the box plots compare the mean number (no) of 
alleles per locus (y-axis, blue), expected heterozygosity (H e ) (y-axis, red) and maximum repeat length (y-axis, green) across different genomic 
positions (x 1o -axis), stratified API (x 2o -axis). Character values (A and B) denote statistical significance between intergenic and intragenic locations, 
within each API category. For the both API strata, there were no significant differences among the mean no of alleles (panels i, ii) per locus or 
mean H e (panels iii, iv) between intergenic and intragenic regions. In both API strata, API <0.05 (panel vi) and API >0.05 (panel v), the mean 
maximum repeat length did vary significantly between intergenic and intragenic, with intragenic loci having significantly higher numbers of 
repeats than intergenic regions (p <0.0001 and p = 0.0303 for API <0.05 and API >0.05, respectively, ANOVA). 



within individual MS markers. Compared with longer 
motifs, shorter motifs were associated with increased 
genetic diversity; six MS markers were in significant ex- 
cess of the mean, while five were significantly reduced 
(Figure 2, Figure 4). Longer repeat lengths, rather than 
shorter repeat lengths, were positively correlated with 
greater diversity; four of the MS markers were in signifi- 
cant excess of the mean, while three were significantly 
reduced (Figure 2, Figure 5). Further, non-perfect repeats 
and intragenic MS also correlated significantly with in- 
creased genetic diversity. For repeat type, there were 
three MS in significant excess of the mean and two sig- 
nificantly reduced from the mean (Figure 2, Figure 6); 
while MS location revealed four in significant excess 
of the mean and five significantly reduced from the 
mean (Figure 2, Figure 7). However, non-perfect repeat 
types were biased towards being located within intra- 
genic regions and shorter motifs with longer repeat 
lengths tended to comprise both non-perfect repeats 
and intragenic MS. Therefore, it is difficult to com- 
pletely resolve the total impact of these MS parame- 
ters on genetic diversity. 

The availability of a validated refined panel of MS 
markers will greatly facilitate the development of im- 
proved comparative population genetics algorithms, which 



will in turn generate a better understanding of the migra- 
tion and evolution of this species. Based on the analyses in 
this study, MS markers have been categorized into four 
groups: (1) 1° Panel, (2) 2° Panel, (3) Excluded and (4) 
MOI (Figure 2). For chromosomes with more than one 
MS marker tested, a priority ranking has been assigned 
(A-D) (Figure 2). Priority is based on the total number of 
studies that have utilized the marker, with a higher priority 
being placed on markers that have been used more fre- 
quently. "1° Panel" (N = 18) indicates balanced diversity in 
all test categories and usage as the primary panel of 
markers for decoding population diversity and structure. 
It is recommended that future studies utilize MS markers 
with "A" priority ranking (N = 9, bold font) to facilitate 
population diversity comparison between global regions, 
as these markers have previously been used with the high- 
est frequency. "2° Panel" (N = 7) indicates significant ex- 
cess or reduction in diversity in one test category. It is 
recommended that the 2° Panel markers be used cau- 
tiously as additional markers to the 1° Panel, as the re- 
sulting population structure may be skewed towards 
decreased or increased diversity due to the inherent unbal- 
anced mutability of the MS marker. "Exclude" (N = 5) in- 
dicates significant reduction in diversity in more than one 
test category. If selected for a population diversity study, it 
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is recommended that the data be thoroughly scrutinized, 
as these markers will result in a skewed interpretation of 
population diversity due to the reduced polymorphic po- 
tential of these MS markers. "MOI" (N = 5) indicates MS 
markers that consistently have significant excess diversity 
in more than one test category. MOI markers are highly 
recommended for identifying multiclonal infections. Two 
of these five MS markers (3.27 and MS8 (ms206), bold 
font) are highly recommended for MOI studies due to 
having extreme excess diversity in more than one test cat- 
egory across both API strata. 
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Additional file 2: Plasmodium vivax microsatellite marker panels 
used in global studies. 
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